{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "63a1d705",
   "metadata": {},
   "source": [
    "# Convert Trapped Tables within PDFs to Pandas DataFrames\n",
    "\n",
    "Pandas is the most popular Python data analysis library available today and can read in data directly from a wide variety of sources, including CSVs, Excel Workbooks, JSON files, SQL databases, parquet files, and even from your clipboard. Currently, there is no direct method using pandas to read in data trapped within a PDF file. Thankfully, the [tabula-py library][1] library is available to read in these tables within a PDF as pandas DataFrames.\n",
    "\n",
    "The tabula-py library itself is a wrapper around [tabula-java][2], a command line tool for extracting trapped data within a PDF. [Get started][3] by installing it with `pip install tabula-py`.\n",
    "\n",
    "## Sample PDF - Book Sales\n",
    "\n",
    "For this tutorial, we will extract a single table within the royalty statement from [Pandas Cookbook][4], a book I published in 2017. An image of the PDF is displayed below.\n",
    "\n",
    "[1]: https://tabula-py.readthedocs.io/en/latest/index.html\n",
    "[2]: https://github.com/tabulapdf/tabula-java\n",
    "[3]: https://tabula-py.readthedocs.io/en/latest/getting_started.html\n",
    "[4]: https://www.amazon.com/Pandas-Cookbook-Scientific-Computing-Visualization/dp/1784393878"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3028bf2",
   "metadata": {},
   "source": [
    "![](pdf_med.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2682d124",
   "metadata": {},
   "source": [
    "## Automatic table detection\n",
    "\n",
    "Tabula-py automatically detects all tables on the pages provided with its `read_pdf` function. All we have to do is provide it the file path of the PDF document and the pages to extract. The returned result is a list of DataFrames, stored here as the variable `dfs`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "798782dc",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "1"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import tabula\n",
    "dfs = tabula.read_pdf('2018Q1.pdf', pages=1)\n",
    "len(dfs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d05a5cf",
   "metadata": {},
   "source": [
    "Tabula has detected a single table. Let's output the DataFrame to see how well the extraction worked."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "901f2700",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Unnamed: 0</th>\n",
       "      <th>AUTHOR ROYALTY STATEMENT</th>\n",
       "      <th>Unnamed: 1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>NaN</td>\n",
       "      <td>ROYALTY STATEMENT DATE: 01/01/18 - 31/03/18</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Pandas Cookbook</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Format</td>\n",
       "      <td>Copies   Revenue</td>\n",
       "      <td>Royalties</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Subscription 3rd Party</td>\n",
       "      <td>0   $ 0.00</td>\n",
       "      <td>$ 0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Video</td>\n",
       "      <td>0   $ 0.00</td>\n",
       "      <td>$ 0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Translation</td>\n",
       "      <td>0   $ 0.00</td>\n",
       "      <td>$ 0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Course</td>\n",
       "      <td>0   $ 0.00</td>\n",
       "      <td>$ 0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Subscription Mapt</td>\n",
       "      <td>203   $ 626.05</td>\n",
       "      <td>$ 100.16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Print Book</td>\n",
       "      <td>366   $ 12,927.57</td>\n",
       "      <td>$ 2,068.41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Ebook</td>\n",
       "      <td>643   $ 9,131.04</td>\n",
       "      <td>$ 1,460.97</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               Unnamed: 0                     AUTHOR ROYALTY STATEMENT  \\\n",
       "0                     NaN  ROYALTY STATEMENT DATE: 01/01/18 - 31/03/18   \n",
       "1         Pandas Cookbook                                          NaN   \n",
       "2                  Format                             Copies   Revenue   \n",
       "3  Subscription 3rd Party                                   0   $ 0.00   \n",
       "4                   Video                                   0   $ 0.00   \n",
       "5             Translation                                   0   $ 0.00   \n",
       "6                  Course                                   0   $ 0.00   \n",
       "7       Subscription Mapt                               203   $ 626.05   \n",
       "8              Print Book                            366   $ 12,927.57   \n",
       "9                   Ebook                             643   $ 9,131.04   \n",
       "\n",
       "   Unnamed: 1  \n",
       "0         NaN  \n",
       "1         NaN  \n",
       "2   Royalties  \n",
       "3      $ 0.00  \n",
       "4      $ 0.00  \n",
       "5      $ 0.00  \n",
       "6      $ 0.00  \n",
       "7    $ 100.16  \n",
       "8  $ 2,068.41  \n",
       "9  $ 1,460.97  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = dfs[0]\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bed7407",
   "metadata": {},
   "source": [
    "For this PDF, I wanted to extract the table beginning in the middle box with the four columns **Format**, **Copies**, **Revenue**, **Royalties**. Instead, Tabula extracted the two boxes above this and returned a three-column table combining Copies and Revenue. While the result isn't bad, it's not usable in the current form. We could use pandas at this point to clean up the data, but tabula-py provides us parameters to define the boundary of the entire table itself and the positioning of the columns.\n",
    "\n",
    "The `relative_area` and `relative_columns` parameters are set to `True` which allows us to provide boundaries to both the `area` and `columns` parameters as a percentage of the page, as opposed to point coordinates. In my opinion, it is easier to define a boundary using relative page percentage.\n",
    "\n",
    "The `area` column is passed a four-item list of top, left, bottom, and right. These represent the boundaries of the table relative to the page. The three-item list for `columns` represent the vertical lines separating the four columns. Let's re-read in the data and output the resulting DataFrame."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "1356c774",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Format</th>\n",
       "      <th>Copies</th>\n",
       "      <th>Revenue</th>\n",
       "      <th>Royalties</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Subscription 3rd Party</td>\n",
       "      <td>0</td>\n",
       "      <td>$ 0.00</td>\n",
       "      <td>$ 0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Video</td>\n",
       "      <td>0</td>\n",
       "      <td>$ 0.00</td>\n",
       "      <td>$ 0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Translation</td>\n",
       "      <td>0</td>\n",
       "      <td>$ 0.00</td>\n",
       "      <td>$ 0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Course</td>\n",
       "      <td>0</td>\n",
       "      <td>$ 0.00</td>\n",
       "      <td>$ 0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Subscription Mapt</td>\n",
       "      <td>203</td>\n",
       "      <td>$ 626.05</td>\n",
       "      <td>$ 100.16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Print Book</td>\n",
       "      <td>366</td>\n",
       "      <td>$ 12,927.57</td>\n",
       "      <td>$ 2,068.41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Ebook</td>\n",
       "      <td>643</td>\n",
       "      <td>$ 9,131.04</td>\n",
       "      <td>$ 1,460.97</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   Format  Copies      Revenue   Royalties\n",
       "0  Subscription 3rd Party       0       $ 0.00      $ 0.00\n",
       "1                   Video       0       $ 0.00      $ 0.00\n",
       "2             Translation       0       $ 0.00      $ 0.00\n",
       "3                  Course       0       $ 0.00      $ 0.00\n",
       "4       Subscription Mapt     203     $ 626.05    $ 100.16\n",
       "5              Print Book     366  $ 12,927.57  $ 2,068.41\n",
       "6                   Ebook     643   $ 9,131.04  $ 1,460.97"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dfs = tabula.read_pdf('2018Q1.pdf', pages=1, \n",
    "                      relative_area=True,\n",
    "                      relative_columns=True,\n",
    "                      area=[42, 0, 60, 100], \n",
    "                      columns= [45, 65, 82])\n",
    "df = dfs[0]\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c8e378f",
   "metadata": {},
   "source": [
    "## Convert to numeric\n",
    "\n",
    "This extraction looks much better and almost exactly what we want. Let's check the data types of each column to understand which ones need more processing."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "3a1e6bcb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Format       object\n",
       "Copies        int64\n",
       "Revenue      object\n",
       "Royalties    object\n",
       "dtype: object"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9fcc5293",
   "metadata": {},
   "source": [
    "Conveniently, the Copies column was successfully converted to an integer. The Revenue and Royalties columns were read in as strings. We'll now have to turn to pandas to convert these columns to floats. We do so by replacing any dollar signs, commas, and spaces with empty spaces using a regular expression."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "6bee65d0",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Format</th>\n",
       "      <th>Copies</th>\n",
       "      <th>Revenue</th>\n",
       "      <th>Royalties</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Subscription 3rd Party</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Video</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Translation</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Course</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.00</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Subscription Mapt</td>\n",
       "      <td>203</td>\n",
       "      <td>626.05</td>\n",
       "      <td>100.16</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Print Book</td>\n",
       "      <td>366</td>\n",
       "      <td>12927.57</td>\n",
       "      <td>2068.41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Ebook</td>\n",
       "      <td>643</td>\n",
       "      <td>9131.04</td>\n",
       "      <td>1460.97</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                   Format  Copies   Revenue  Royalties\n",
       "0  Subscription 3rd Party       0      0.00       0.00\n",
       "1                   Video       0      0.00       0.00\n",
       "2             Translation       0      0.00       0.00\n",
       "3                  Course       0      0.00       0.00\n",
       "4       Subscription Mapt     203    626.05     100.16\n",
       "5              Print Book     366  12927.57    2068.41\n",
       "6                   Ebook     643   9131.04    1460.97"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[['Revenue', 'Royalties']] = (df[['Revenue', 'Royalties']]\n",
    "                                  .replace('[$, ]', '', regex=True)\n",
    "                                  .astype('float64'))\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f33e9904",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "In this tutorial, we passed custom inputs into tabula-py's `read_pdf` function to extract a table trapped within a PDF as a pandas DataFrame. We then cleaned up the result with pandas ability to replace characters using a regular expression. Several [more options][1] are available to tabula-py to customize the extraction process.\n",
    "\n",
    "[1]: https://tabula-py.readthedocs.io/en/latest/tabula.html"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee019af0",
   "metadata": {},
   "source": [
    "# Master Data Analysis with Python\n",
    "\n",
    "If you enjoyed this lesson and are looking to become an expert with Pandas, then check out my book, [Master Data Analysis with Python][1]. It is the most comprehensive Pandas book available, and comes with video lessons, 500+ exercises with solutions, and certification exams.\n",
    "\n",
    "[1]: https://dunderdata.com/master-data-analysis-with-python"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
